Downloading microbiome sequences from SRA
Brief Intro
Microlbiome read sequencing data may be obtained from different sources. The most common ones include:
- Reads directly from a sequencing platforms.
- Reads downloaded from the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).
- Reads synthesized using sequencing simulators.
Snakemake workflow rules
Tentative snakemake workflow
Setting up SRA Toolkit
Quick glimpse
The NCBI Sequence Read Archive (SRA) stores sequencing data from the next generation sequencing platforms. Users can download data from the SRA archive using the SRA toolkits or custom computational methods.
Demo Installing SRA Toolkit on Mac OS.
Download sratoolkit
- Navigate to where you want to install the tools, preferably the home directory.
- For more information click here.
curl -LO https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-mac64.tar.gz
tar -xf sratoolkit.3.0.0-mac64.tar.gz
export PATH=$HOME/sratoolkit.3.0.0-mac64/bin/:$PATH
Create a cache root directory
mkdir -p ~/ncbi
echo '/repository/user/main/public/root = "cache_directory"' > ~/ncbi/user-settings.mkfg
Confirm sra toolkit configuration
- The
vdb-config -icommand below will display a blue colored dialog. - Use tab or click
cto navigate to cache tab. - Review the configuration then save
sand exitx.
vdb-config -i
A screenshot of the SRA configuration.
For more information click here.
Using already installed sratools
We can create an environment and install essential toolkits (Refer IMAP-PART 01)
name: sradb
channels:
- conda-forge
- bioconda
dependencies:
- snakemake =7.19.1
- snakemake-minimal =7.19.1
- snakedeploy =0.8.6
- sra-tools
- entrez-direct
- pysradb
- insilicoseq =1.5.4
- seqkit =2.3.1
mamba create -bioconda -conda-forge sradb -file environment.yml
Downloading multiple fastq files
Using fasterq-dump
- Be sure that the
fasterq-dumpis in the path. - Type
which fasterq-dumporfasterq-dump --helpto confirm. - Must specify the output and temporary files.
- It is possible to specifies a range of SRA accessions to use in a
for loop.
Example code for download reads for SRA accessions ranging from SRR7450706 to SRR7450761
for (( i = 706; i <= 761; i++ ))
do
time fasterq-dump SRR7450$i \
--split-3 \
--force \
--skip-technical \
--outdir data/reads \
--temp data/temp \
--threads 4
done
Compressing and uncompressing files
The microbiome fastq files are usually very large. Compressing them may save lots of space.
Example syntaxies
gunzip data/reads/*.gz
gzip data/reads/*.fastq
How to resize Fastq files
Purpose
- Sometimes we want to extract a small subset to test the bioinformatics pipeline.
- You can resize the fastq files using the
seqkit samplefunction[seqkit2022?]. - Below is a quick demo for extracting only 1% of the paired-end metagenomics sequencing data.
Example
This example extract 1% of the reads in only two sample (SRR10245277 & SRR10245278)
mkdir -p data
for i in {77..78}
do
cat SRR102452$i\_R1.fastq \
| seqkit sample -p 0.01 \
| seqkit shuffle -o data/SRR102452$i\_R1_sub.fastq \
| cat SRR102452$i\_R2.fastq \
| seqkit sample -p 0.01 \
| seqkit shuffle -o data/SRR102452$i\_R2_sub.fastq
done
References
Appendix
Project main tree
.
├── LICENSE
├── README.md
├── config
│  ├── config.yaml
│  ├── samples.tsv
│  └── units.tsv
├── dags
│  ├── rulegraph.png
│  └── rulegraph.svg
├── data
│  ├── metadata
│  ├── reads
│  ├── temp
│  └── test
├── docs
│  └── env_spec_file.txt
├── images
│  ├── smkreport
│  ├── sra.png
│  └── sra_config_cache.png
├── index.Rmd
├── library
│  ├── apa.csl
│  ├── imap.bib
│  └── references.bib
├── report.html
├── results
│  ├── project_tree.txt
│  └── run_accessions.txt
├── styles.css
└── workflow
├── Snakefile
├── envs
├── rules
├── schemas
└── scripts
17 directories, 19 files
Screenshot of interactive snakemake report
The interactive snakemake HTML report can be viewed by opening the
report.htmlusing any compatible browser. You will be able to explore the workflow and the associated statistics. You can close the left bar to get a more expansive display view.
Troubleshooting of FAQs
- Question
- Question
-
Answer
-
Answer